Prom Massively Parallel Algorithms and Fluctuating Time Horizons to 

Non-equilibrium Surface Growth 
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We study the asymptotic scaling properties of a massively parallel algorithm for discrete-event 
simulations where the discrete events are Poisson arrivals. The evolution of the simulated time hori- 
zon is analogous to a non-equilibrium surface. Monte Carlo simulations and a coarse-grained approx- 
imation indicate that the macroscopic landscape in the steady state is governed by the Edwards- 
Wilkinson Hamiltonian. Since the efficiency of the algorithm corresponds to the density of local 
minima in the associated surface, our results imply that the algorithm is asymptotically scalable. 

PACS numbers: 89.80.+h, 02.70.Lq, 05.40.-a, 68.35. Ct 



To efficiently utilize modern supercomputers requires 
massively parallel implementations of dynamic algo- 
rithms for various physical, chemical, and biological pro- 
cesses. For many of these there are well-known and rou- 
tinely used serial Monte Carlo (MC) schemes which are 
based on the realistic assumption that attempts to up- 
date the state of the system form a Poisson process. The 
parallel implementation of these dynamic MC algorithms 
belongs to the class of parallel discrete-event simulations, 
which is one of the most challenging areas in parallel 
computing Q and has numerous applications not only 
in the physical sciences, but also in computer science, 
queueing theory, and economics. For example, in lattice 
Ising models the discrete events are spin-flip attempts, 
while in queueing systems they are job arrivals. Since 
current special- or multi-purpose parallel computers can 
have 10 4 — 10 5 processing elements (PE) it is essen- 
tial to understand and estimate the scaling properties of 
these algorithms. 

In this Letter we introduce an approach to investigate 
the asymptotic scaling properties of an extremely robust 
parallel scheme M. This parallel algorithm is applica- 
ble to a wide range of stochastic cellular automata with 
local dynamics, where the discrete events are Poisson ar- 
rivals. Although attempts have been made to estimate 
its efficiency under some restrictive assumptions H, the 
mechanism which ensures the scalability of the algorithm 
in the "steady state" was never identified. Here we ac- 
complish this by noting that the simulated time hori- 
zon is analogous to a growing and fluctuating surface. 
The local random time increments correspond to the de- 
position of random amounts of "material" at the local 
minima of the surface. This correspondence provides a 
natural ground for cross-disciplinary application of well- 
known concepts from non-equilibrium surface growth |j| 
and driven systems [|| to a certain class of massively par- 



allel computational schemes. To estimate the efficiency 
of this algorithm one must understand the morphology of 
the surface associated with the simulated time horizon. 
In particular, the efficiency of this parallel implementa- 
tion (the fraction of the non-idling processing elements) 
exactly corresponds to the density of local minima in the 
surface model. We show that the steady-state behavior of 
the macroscopic landscape is governed by the Edwards- 
Wilkinson (EW) Hamiltonian |7j , implying that the den- 
sity of the local minima does not vanish when the number 
of PEs goes to infinity. This ensures that the simulated 
time horizon propagates with a non-zero average veloc- 
ity in the steady state. Thus the algorithm is asymp- 
totically scalable! Further, based on the strong analogy 
between the evolution of the simulated time horizon and 
the single-step surface growth model [|| , we describe the 
asymptotic scaling properties of the parallel scheme. 

The difficulty of parallel discrete-event simulations is 
that the discrete events (update attempts) are not syn- 
chronized by a global clock. The traditional dynamic MC 
algorithms were long believed to be inherently serial, i.e., 
in spin language, the corresponding algorithm could at- 
tempt to update only one spin at a time. Lubachevsky 
nevertheless presented an approach for parallel simula- 
tion of these systems || without changing the dynamics of 
the underlying model |J] . Applications of his scheme in- 
clude modeling of cellular communication networks |L2| , 
ballistic particle deposition p3| , and metastability and 
hysteresis in kinetic Ising models jl4| . 

Here we consider the case of one-dimensional systems 
with only nearest-neighbor interactions (e.g., Glauber 
spin-flip dynamics) and periodic boundary conditions. 
We restrict ourselves to the case where each PE car- 
ries one site (e.g., one spin) of the underlying system. 
Non-zero efficiency for this algorithm implies non-zero 
efficiency for the case where a PE carries a block of sites 
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. Also, the stochastic model for the simulated time 
horizon is an exact mapping only for the synchronous al- 
gorithm, where the main simulation cycles are executed 
in lock-step on each PE. Our goal is to show for this 
one-site-per-PE synchronous algorithm (which can be re- 
garded as the worst-case scenario) that the efficiency does 
not go to zero as the number of PEs goes to infinity. The 
basic synchronous parallel scheme Q is as follows. 

The size of the underlying system, and thus the number 
of PEs, is L. Update attempts at each site are indepen- 
dent Poisson processes with the same rate, independent 
of the state of the underlying system. Hence, the random 
time interval between two successive attempts is expo- 
nentially distributed. Without loss of generality we use 
time increments of mean one (in arbitrary units). The 
Poisson arrivals correspond to attempted instantaneous 
changes in the state of the site. In the parallel algorithm 
each PE generates its own local simulated time for the 
next update attempt, denoted by Tj(t), i — 1,2, ... , L. 
Here, t is the discrete index of the parallel steps simul- 
taneously performed by each PE. Initially Ti(0) = for 
each site, and an initial configuration for the underly- 
ing system is specified. The simulated time of the first 
update attempt is determined by t,(1) = Ti(0) + 7^(0), 
where {rji} are independent exponential variables. For 
parallel steps t > 1, each PE must compare its local sim- 
ulated time to the local simulated times of its neighbors. 
If Ti{t) < min{r i _i(t), r i+ i(t)}, the change of state of 
the site is attempted (and decided by the rules of the 
underlying system) , and its local simulated time is incre- 
mented by an exponentially distributed random amount, 
Ti(t + 1) = Ti(t) + rji(t). Otherwise, the change of state 
is not attempted and the local simulated time remains 
the same, Ti(t + 1) = Ti(t), i.e., the PE waits ("idles"). 
The comparison of the nearest-neighbor simulated times, 
and waiting if necessary, ensures that information passed 
between PEs does not violate causality. The algorithm is 
obviously free from deadlock, since at worst the PE with 
the absolute minimum local time can make progress. Af- 
ter the initial step, the probability density of the simu- 
lated time horizon {^(i)} is a continuous measure, so the 
probability that updates for two nearest-neighbor sites 
are attempted at the same simulated time, is of measure 
zero. When modeling the efficiency, we ignore communi- 
cation times between PEs, since they typically contribute 
to a scalable overhead. Thus, the efficiency is simply the 
fraction of non-idling PEs (inherent utilization). This ex- 
actly corresponds to the density of local minima of the 
simulated time horizon. 

The question naturally arises: Is it possible that the 
fraction of non-idling PEs goes to zero in the L—^oo limit? 
This would obviously make the algorithm unscalable and 
the performance of the actual implementation poor if not 
disastrous. To study this problem, we focus on the evolu- 
tion of the simulated time horizon {ri(t)}, which is com- 
pletely independent of the underlying model. The above 



algorithmic steps can be compactly summarized as: 

n(t + l) = Tt (t) (1) 
+ e (n-i(t) - n{t)) o (r m (t) - n{t)) ^(t) . 

The r\i (t) are drawn from an exponential distribution in- 
dependently at every time t and site i, and independent 
of {ri(t)}. Here ©(•) is the Heaviside step- function. This 
stochastic evolution model is very simple and easily sim- 
ulated on a serial computer. Alternatively, one can con- 
sider the evolution of the local slopes, fa = Ti — Ti—f. 



fa(t + 1) - fa(t) = e (-&(*)) e (fa +1 (t)) r)i(t) 
-e(-&_i(t))e (&(<)) 



(2) 



The periodic boundary conditions for {t^} impose the 
constraint Yli=i 4>% = 0- ln this representation the oper- 
ator for the density of local minima is 



u(t) 



I^e(-^(t))6(^ +1 (*)) 



(3) 



The process described by (g) is a microscopic realization 
of biased diffusion Q . The random amount of material 
"deposited" at a local minimum n corresponds to the 
transfer of this amount from fa+i to fa. Since the noise 
in (^|) is independent of {fa(t}}, the average can be sim- 
ply taken. This yields a transparent continuity equation, 
(fa(t + 1)} - (fa{t)) = -((ji) - (ji-i)), where the aver- 
age current is (ji) — —(Q(—fa)Q(fa + i)). Translational 
invariance implies that (u) = (Q(—fa)Q(fa+i)), which is 
the same as the magnitude of the average current or the 
mean velocity of the surface. 

To gain some insight into the evolution of the surface, 
we perform a naive coarse-graining by taking an ensem- 
ble average on (||) and replacing 0(0) with a smooth 
representation. The procedure is independent of the 
actual form of the representation. We use O K (0) = 
(l/2)[tanh(0//c)+l] so lim K ^ ©«(<£) = ©(<£)■ To leading 
order in 61k, 



,(t + 1)) - (fa(t)) = —(fa+i - 2fa + fa_i) 
4k 



(4) 



Taking the naive continuum limit one obtains 
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(•5) 



for the coarse-grained field, fa where roughly speaking 
A, the coefficient of the nonlinear term, carries the de- 
tails of the coarse-graining procedure. Equation (p) is 
the nonlinear biased diffusion or Burgers' equation [151. 
Through 6 = df/dx it is simply related to the deter- 
ministic part of the KPZ equation for the coarse-grained 
surface height fluctuations Jl6[, 
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To capture the fluctuations one typically extends the 
above equations with appropriate noise, i.e., conserved 
for (|J), and non-conserved for (^). This implies that 
the evolution of the simulated time horizon is KPZ- 
like. In one dimension the steady state of such sys- 
tems (on coarse-grained length scales) is governed by 
the EW Hamiltonian 0, 1i.EW c ^Jdx(dT/dx) 2 . This cor- 
responds to a simple random-walk surface, where the 
coarse-grained slopes are independent in the thermody- 
namic limit, yielding 1/4 for the density of local min- 
ima. Obviously this value will be different for our spe- 
cific microscopic model. However it cannot vanish: a zero 
density of local minima in the L—kx> limit would imply 
that it is zero at all length scales. This would contradict 
our finding that the steady state at the coarse-grained 
level is governed by the EW Hamiltonian. The non-zero 
density of local minima is an important characteristic of 
this (steady-state) universality class. It ensures that our 
specific microscopic surface propagates with a non-zero 
average velocity in the steady state. Models belonging 
to other universality classes do not necessarily have non- 
zero extremal-point densities. For example, we can show 
fl7f that the density of local minima vanishes for a one- 
dimensional curvature-driven random Gaussian surface. 

We now present our MC results to test the coarse- 
graining approach. First we follow the time evolution 
of the width (w 2 (t)) = (l/L)(J2ti [n(t)-T(t)} 2 ), where 
f(t)=(l/L) X)i=i r i(*) an d the average (•) is taken over 
many independent runs. After the early-time regime, 
which is strongly affected by the intrinsic width, and be- 
fore saturation, we find (w 2 (t))~t 2 P [Fig. 1(a)]. Although 
the system exhibits very strong corrections to scaling, 
for our largest system, L=10 5 , we find /3=0.326 ± 0.005, 
which includes within two standard errors the exact KPZ 
exponent, 1/3. In the steady state the width is stationary 
and (w 2 )~L 2a for large L. Here the corrections to scaling 
are somewhat smaller than in the earlier regime, and the 
above scaling is obeyed for L>10 3 with ev=0.49 ± 0.01. 
This agrees with the prediction that the long-distance 
behavior is governed by the EW Hamiltonian, for which 
a=l/2. Further, plotting rescaled width (w 2 ) / L 2a vs 
rescaled time t/L z , with z=a/f3, confirms dynamic scal- 
ing for the intermediate-to-late time crossover jl8| [Fig. 
1(a) inset]. We also measured the average steady-state 
structure factor of {r^}, finding the expected ~ 1/k 2 be- 
havior for small wave vectors k. The spatial two-point 
correlations decay linearly for {t^} and are short ranged 
for the slopes, {&}■ 

To further probe the universal properties of the sur- 
face in the steady state we construct the full width dis- 
tribution, P(w 2 ). The EW class is characterized by 
a universal scaling function, &(x), such that P(w 2 ) = 
{w 2 )- 1 ^(w 2 /(w 2 )) p|, where 



Systems with L>10 3 show convincing data collapse [Fig. 
1(b)] onto this exact scaling function. 
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FIG. 1. (a) Time evolution and dynamic scaling (inset) for 
the surface width, (b) Steady-state width distribution (inset: 
on linear-log scales). 

Next we estimate the efficiency of the algorithm. When 
simulating the system described by (Q) and measuring 
the average local minimum density (u(t)}, we observe 
that for every system size it monotonically decreases as 
a function of time and approaches a constant slightly 
smaller than 1/4 for large systems. Using the close sim- 
ilarity between our model and the single-step solid-on- 
solid surface-growth model we can understand the 
scaling behavior for the steady-state average (u) and the 
fluctuations tr 2 = (u 2 ) — (u) 2 . In the single-step model 
the height differences (i.e., the local slopes) are restricted 
to ±1, and the evolution consists of particles of height 2 
being deposited at the local minima. The advantage of 
this model is that it can be mapped onto a hard-core 
lattice gas for which the steady-state probability dis- 
tribution of the configurations is known exactly |j§[p0|. 
This enables one to find arbitrary moments of the lo- 
cal minimum density operator, analogous to (||). For the 
single-step model we find that (u)l — (1/4) (1 — 1/L)^ 1 — 
1/4 + 1/(4L) + 0(l/L 2 ), and a\ = 1/(16L) + 0(1/L 2 ). 
We propose that the scaling of the local minimum density 
for large L in our model follows the same form, i.e., 



u 



'oo 



(8) 



Our reasons are: (?) both models in their steady states 
belong to the EW universality class (short-range corre- 
lated local slopes), which guarantees that (it) oo is non- 
zero; (it) the constraint Y^,t=i 'Pi = in our model 
and the conservation of the number of particles in the 
lattice gas will produce similar finite-size effects. Our 
simulation results show very good agreement with m) 
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[Fig. 2(a)]. Further, extrapolating to L^oo, yields 
(m)oo=0.24641 ± (7 x 1CT 6 ). The scaling relation (g) also 
implies that u is a self-averaging macroscopic quantity: 
its full distribution Pr, (u) , for large L, is Gausian with the 
parameters in (JsJ) [Fig. 2(b)]. Thus, in the L^oo limit it 
approaches a delta-function centered about (tt)oo- 
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FIG. 2. Steady-state scaling for the parallel efficiency (den- 
sity of local minima), (a) The average and the variance (in- 
set), (b) The full probability density. Inset: scaled probabil- 
ity densities, Pl(u) with u = (u — {v)l)/<tl, collapse onto a 
Gaussian with zero mean and unit variance. 

Finally, we point out the lack of up-down symmetry in 
our model in the steady-state. This is most easily notice- 
able at short distances, either by looking at a snapshot 
[Fig. 3(a)], or through the high degree of asymmetry in 
the nearest-neighbor two-slope distribution [Fig. 3(b)]: 
the hilltops are sharp and the valley-bottoms are flat- 
tened. Such stationary-state skewness is generally ob- 
servable in one-dimensional KPZ growth, but has only 
recently received serious attention M ,B3] . 





FIG. 3. (a) A short segment (50 sites) of a typical 
steady-state surface configuration, (b) Density plots for the 
nearest-neighbor two-slope distribution, L—lQ i . 

In summary, we studied the asymptotic scaling prop- 
erties of a general parallel algorithm by regarding the 
simulated time horizon as a non-equilibrium surface. We 
conclude that the basic algorithm (one site per PE) is 
scalable for one-dimensional arrays. The same corre- 
spondence can be applied to model the performance of 
the algorithm for higher-dimensional logical PE topolo- 



gies. While this will involve the typical difficulties of 
surface-growth modeling, such as an absence of exact re- 
sults and very long simulation times, it establishes po- 
tentially fruitful connections between two traditionally 
separate research areas. 
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